智能论文笔记

Structured information extraction from complex scientific text with fine-tuned large language models

Alexander Dunn , John Dagdelen , Nicholas Walker , Sanghoon Lee , Andrew S. Rosen , Gerbrand Ceder , Kristin Persson , Anubhav Jain

分类：自然语言处理

2022-12-10

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

translated by 谷歌翻译

Learning a Pedestrian Social Behavior Dictionary

Faith Johnson , Kristin Dana

分类：计算机视觉

2022-12-02

Understanding pedestrian behavior patterns is a key component to building autonomous agents that can navigate among humans. We seek a learned dictionary of pedestrian behavior to obtain a semantic description of pedestrian trajectories. Supervised methods for dictionary learning are impractical since pedestrian behaviors may be unknown a priori and the process of manually generating behavior labels is prohibitively time consuming. We instead utilize a novel, unsupervised framework to create a taxonomy of pedestrian behavior observed in a specific space. First, we learn a trajectory latent space that enables unsupervised clustering to create an interpretable pedestrian behavior dictionary. We show the utility of this dictionary for building pedestrian behavior maps to visualize space usage patterns and for computing the distributions of behaviors. We demonstrate a simple but effective trajectory prediction by conditioning on these behavior labels. While many trajectory analysis methods rely on RNNs or transformers, we develop a lightweight, low-parameter approach and show results comparable to SOTA on the ETH and UCY datasets.

translated by 谷歌翻译

Toward Human-AI Co-creation to Accelerate Material Discovery

Dmitry Zubarev , Carlos Raoni Mendes , Emilio Vital Brazil , Renato Cerqueira , Kristin Schmidt , Vinicius Segura , Juliana Jansen Ferreira , Dan Sanders

分类：机器学习 | 人工智能

2022-11-05

There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several recent advances in Machine Learning and AI to address some of these challenges, there is still a gap in technologies to support end-to-end discovery applications, integrating the myriad of available technologies into a coherent, orchestrated, yet flexible discovery process. Such applications need to handle complex knowledge management at scale, enabling knowledge consumption and production in a timely and efficient way for subject matter experts (SMEs). Furthermore, the discovery of novel functional materials strongly relies on the development of exploration strategies in the chemical space. For instance, generative models have gained attention within the scientific community due to their ability to generate enormous volumes of novel molecules across material domains. These models exhibit extreme creativity that often translates in low viability of the generated candidates. In this work, we propose a workbench framework that aims at enabling the human-AI co-creation to reduce the time until the first discovery and the opportunity costs involved. This framework relies on a knowledge base with domain and process knowledge, and user-interaction components to acquire knowledge and advise the SMEs. Currently,the framework supports four main activities: generative modeling, dataset triage, molecule adjudication, and risk assessment.

translated by 谷歌翻译

Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation

Xuesu Xiao , Tingnan Zhang , Krzysztof Choromanski , Edward Lee , Anthony Francis , Jake Varley , Stephen Tu , Sumeet Singh , Peng Xu , Fei Xia

分类：机器人 | 人工智能 | 机器学习

2022-09-22

尽管进行了数十年的研究，但现有的导航系统在野外部署时仍然面临现实世界中的挑战，例如在混乱的家庭环境或人类占领的公共场所中。为了解决这个问题，我们提出了一类新的隐式控制政策，将模仿学习的好处与模型预测控制（MPC）的系统约束的强大处理结合在一起。我们的方法称为Performer-MPC，使用了通过表演者提供的视觉上下文嵌入的学习成本函数（一种低级隐式意见变压器）。我们共同训练成本函数并构建依靠它的控制器，有效地端到端解决相应的双层优化问题。我们表明，由此产生的策略通过利用一些在不同挑战的现实世界情景中利用一些专家演示来提高标准MPC绩效。与标准的MPC政策相比，表演者MPC在混乱的环境中实现了40％的目标，而在人类浏览时，社交指标的目标> 65％。

translated by 谷歌翻译

Which Factors Drive Open Access Publishing? A Springer Nature Case Study

Fakhri Momeni , Stefan Dietze , Philipp Mayr , Kristin Biesenbender , Isabella Peters

分类：机器学习

2022-08-17

开放访问（OA）有助于访问文章。但是，作者或资助者通常必须支付出版费用，以防止没有参加OA出版和参与OA文章的引文优势的作者。 OA可能会加剧出版系统中现有的不平等现象，而不是克服它们。为了调查这一点，我们研究了Springer Nature发表的522,664篇文章。采用统计方法，我们描述了与来自不同收入水平的国家 /地区的作者之间的关系，其出版选择（OA或封闭式访问）以及论文的引用影响。一种机器学习分类方法帮助我们探索了作者的OA出版与属性之间的关联，尤其是有资格获得APC Waivers或折扣，期刊，国家和论文。结果表明，与其他作者相比，有资格获得APC-Waivers的作者在Gold-Oa-Journals上发布更多。相比之下，有资格获得APC折扣的作者的OA出版物比率最低，从而假设这种折扣不足以激发作者在Gold-Oa-Journal中发布。期刊的排名是在金色杂志上发布的重要驱动力，而OA选项大多是在混合期刊中避免的。资历，OA出版物的经验以及科学领域是OA出版物中最具决定性的因素。

translated by 谷歌翻译

Deep Learning for Material Decomposition in Photon-Counting CT

Alma Eguizabal , Ozan Öktem , Mats U. Persson

分类：机器学习

2022-08-05

光子计数CT（PCCT）通过更好的空间和能量分辨率提供了改进的诊断性能，但是开发可以处理这些大数据集的高质量图像重建方法是具有挑战性的。基于模型的解决方案结合了物理采集的模型，以重建更准确的图像，但取决于准确的前向操作员，并在寻找良好的正则化方面遇到困难。另一种方法是深度学习的重建，这在CT中表现出了巨大的希望。但是，完全数据驱动的解决方案通常需要大量的培训数据，并且缺乏解释性。为了结合两种方法的好处，同时最大程度地降低了各自的缺点，希望开发重建算法，以结合基于模型和数据驱动的方法。在这项工作中，我们基于展开/展开的迭代网络提出了一种新颖的深度学习解决方案，用于PCCT中的材料分解。我们评估了两种情况：一种学识渊博的后处理，隐含地利用了模型知识，以及一种学到的梯度，该梯度在体系结构中具有明确的基于模型的组件。借助我们提出的技术，我们解决了一个具有挑战性的PCCT模拟情况：低剂量，碘对比度和很小的训练样品支持的腹部成像中的三材料分解。在这种情况下，我们的方法的表现优于最大似然估计，一种变异方法以及一个完整的网络。

translated by 谷歌翻译

Development and Validation of ML-DQA -- a Machine Learning Data Quality Assurance Framework for Healthcare

Mark Sendak , Gaurav Sirdeshmukh , Timothy Ochoa , Hayley Premo , Linda Tang , Kira Niederhoffer , Sarah Reed , Kaivalya Deshpande , Emily Sterrett , Melissa Bauer

分类： (统计)机器学习 | 机器学习

2022-08-04

机器学习和临床研究社区利用现实世界数据（RWD）的方法，包括电子健康记录中捕获的数据（EHR）截然不同。虽然临床研究人员谨慎使用RWD进行临床研究，但用于医疗团队的ML会消费公共数据集，并以最少的审查来开发新算法。这项研究通过开发和验证ML-DQA来弥合这一差距，ML-DQA是基于RWD最佳实践的数据质量保证框架。 ML-DQA框架适用于两个地理位置的五个ML项目，分别是不同的医疗状况和不同的人群。在这五个项目中，共收集了247,536名患者的RWD，共有2,999项质量检查和24份质量报告。出现了五种可推广的实践：所有项目都使用类似的方法来分组冗余数据元素表示；所有项目都使用自动实用程序来构建诊断和药物数据元素；所有项目都使用了一个共同的基于规则的转换库；所有项目都使用统一的方法将数据质量检查分配给数据元素；所有项目都使用类似的临床裁决方法。包括临床医生，数据科学家和受训者在内的平均有5.8个人参与每个项目实施ML-DQA，每个项目平均进行了23.4个数据元素。这项研究证明了ML-DQA在医疗项目中的重要性作用，并为团队提供了开展这些基本活动的框架。

translated by 谷歌翻译

Accelerated and interpretable oblique random survival forests

Byron C. Jaeger , Sawyer Welden , Kristin Lenoir , Jaime L. Speiser , Matthew Segar , Ambarish Pandey , Nicholas M. Pajewski

分类： (统计)机器学习

2022-08-01

倾斜的随机生存森林（RSF）是一种用于右翼结果的合奏监督学习方法。斜RSF中的树是使用预测变量的线性组合生长的，以创建分支，而在标准RSF中，使用单个预测变量。倾斜的RSF集合通常比标准RSF合奏具有更高的预测准确性。但是，评估预测变量的所有可能的线性组合会诱导大量的计算开销，从而将应用限制为大规模数据集。此外，几乎没有开发用于解释斜RSF合奏的方法，与基于轴的对应物相比，它们仍然难以解释。我们介绍了一种提高斜力RSF计算效率的方法，以及一种用斜RSF估计单个预测变量重要性的方法。我们减少计算开销的策略是利用牛顿 - 拉夫森评分（Newton-Raphson）评分，这是一种经典的优化技术，我们适用于决策树的每个非叶子节点内的COX部分似然函数。我们通过在线性组合中否定了用于给定预测指标的每个系数，然后计算出降低的降低准确性，从而估计单个预测因子对斜RSF的重要性。通常，在基准测试实验中，我们发现，与现有的斜RSF相比，与现有软件相比，我们对斜RSF的实现速度约为450倍，而较高的Brier得分则要高450倍。我们在模拟研究中发现，“否定重要性”比置换重要性，莎普利添加性解释和先前引入的技术更可靠地区分相关和无关的预测因子，以基于方差分析来衡量斜RSF的可变重要性。当前研究中引入的方法可在AORSF R软件包中获得。

translated by 谷歌翻译

DNNShield: Dynamic Randomized Model Sparsification, A Defense Against Adversarial Machine Learning

Mohammad Hossein Samavatian , Saikat Majumdar , Kristin Barber , Radu Teodorescu

分类：机器学习

2022-07-31

已知DNN容易受到所谓的对抗攻击的攻击，这些攻击操纵输入以引起不正确的结果，这可能对攻击者有益或对受害者造成损害。最近的作品提出了近似计算，作为针对机器学习攻击的防御机制。我们表明，这些方法虽然成功地用于一系列投入，但不足以解决更强大，高信任的对抗性攻击。为了解决这个问题，我们提出了DNNShield，这是一种硬件加速防御，可使响应的强度适应对抗性输入的信心。我们的方法依赖于DNN模型的动态和随机稀疏来有效地实现推理近似值，并通过对近似误差进行细粒度控制。与检测对抗输入相比，DNNShield使用稀疏推理的输出分布特征。当应用于RESNET50时，我们显示出86％的对抗检测率为86％，这超过了最先进的接近状态的检测率，开销较低。我们演示了软件/硬件加速的FPGA原型，该原型降低了DNNShield相对于仅软件CPU和GPU实现的性能影响。

translated by 谷歌翻译

The MABe22 Benchmarks for Representation Learning of Multi-Agent Behavior

Jennifer J. Sun , Andrew Ulmer , Dipam Chakraborty , Brian Geuther , Edward Hayes , Heng Jia , Vivek Kumar , Zachary Partridge , Alice Robie , Catherine E. Schretter

分类：机器学习 | 人工智能 | 计算机视觉

2022-07-21

现实世界的行为通常是由多种代理之间复杂的相互作用来塑造的。为了可靠地研究多代理行为，无监督和自我监督的学习的进步使从轨迹数据中学到了各种不同的行为表示。迄今为止，还没有一组统一的基准测试，可以在广泛的行为分析设置中进行定量和系统地比较方法。我们的目的是通过引入来自现实世界行为神经科学实验的大规模，多代理轨迹数据集来解决这一问题，该数据集涵盖了一系列行为分析任务。我们的数据集由来自通用模型生物的轨迹数据组成，其中有960万帧的小鼠数据和440万帧的飞行数据，在各种实验环境中，例如不同的菌株，相互作用的长度和光遗传学刺激。框架的子集还包括专家注销的行为标签。我们数据集的改进对应于跨多种生物的行为表示，并能够捕获常见行为分析任务的差异。

translated by 谷歌翻译